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Abstract: In higher education such as university, 
academic is becoming major asset. The performance 
of academic has become a yardstick of university 
performance. Therefore it's important to know the 
talent of academicians in their university, so that the 
management can plan for enhancing the academic 
talent using human resource data. Therefore, this 
research aims to develop an academic talent model 
using data mining based on several related human 
resource systems. In the case study, we used 7 human 
resource systems in one of Government Universities in 
Malaysia. This study shows how automated human 
talent data mart is developed to get the most important 
attributes of academic talent from 15 different tables 
like demographic data, publications, supervision, 
conferences, research, and others. Apart from the 
talent attribute collected, the forecasting talent 
academician model developed using the classification 
technique involving 14 classification algorithm in the 
experiment for example J48, Random Forest, 
BayesNet, Multilayer perceptron, JRip and others. 
Several experiments are conducted to get the highest 
accuracy by applying discretization process, dividing 
the data set in the different interval year (1,2,3,4, no 
interval) and also changing the number of classes from 
24 to 6 and 4. The best model is obtained 87.47% 
accuracy using data set interval 4 years and 4 classes 
with J48 algorithm. 

Keywords: Academician database, Classification, 
Data mart, Talent, Forecasting. 

I. INTRODUCTION 

Department of Human Resource Management 
(HRM) is working in employee-related activities in an 
organization!!]. Human resources are limited to a 
particular organization, it is important to be managed 
effectively in helping the organization towards 
excellence. HRM is currently having to deal with 
many challenges such as globalization, to increase the 
income of the organization, technology changes, 
manage intellectual capital and a challenge to change 
[2]. The intellectual capital is one of the challenges 
faced by the HRM. Finding, developing and retaining 
talent is the main concern for human resource 



executives as in the study released by Ore Worldwide 
based in New York on issues with HRM [3]. 

Found only 25 percent of managers in a systematic 
talent identification and most of the errors that occur 
when measuring talent is like measuring the wrong 
things, focus on the whole but not in accordance with 
certain talent matrix, the focus of analysis on summary 
data when there is hidden information that is not 
known and does not use data to make better decisions 
[4]. Talent management can be defined as a systematic 
and dynamic process to identify, develop and retain 
talent. Talent management processes are dependent on 
how the organization practices [5]. 

Methods of forecasting talent for organization 
employees are diverse and mostly still managed 
informally and through surveys of 250 respondents 
from the executive officers who are directly involved 
in the talent management of employees, the largest 
number of respondents using the involvement of senior 
leaders in talent program and a lot of using the human 
resource technology, the rest of them using the award 
based on performance, through surveys and training of 
personnel organization to identify and develop senior 
talent [6]. The technology used like decision support 
systems [7] [8], data mining [9, 10] and another 
method [11]. Employee talent can be predicted with 
past information available. The knowledge gained will 
facilitate the management of HRM and choose workers 
according to performance standards to avoiding the 
inconsistency in decision-making appointments [12]. 

This study gives an example technique on data 
preparation step, exploration of 15 data sets of 
selected university human resource database. 14 
classification algorithms are involved in preparing the 
academicians talent forecasting model. The 
comparison results of these 14 algorithms are shown as 
a side conclusion from the study. 

II. PREDICT ACADEMIC TALENT USING DATA 
MINING 

Talent management in the academics is quite 
lacking as compared to other organization's talent [13- 
16]. Although there is increasing from year to year the 



www.ijorcs.org 



IJORCS 



30 



Mahani Saron, Zulaiha All Othman 



number of studies in the field of HRM data mining 
approach [17]. The studies on HRM domain from 1990 
to 2011, only seven of the 106 study was associated 
with the talent of employees[17]. However, from seven 
studies that employee talent is the result of four from 
the same author. 

Academic talent Malaysian public universities are 
measured in terms of professional qualifications, 
awards & recognition and administration & 
contribution to the university and identified this 
measure academic talent through a number of criteria 
in terms of employee talent Practices determination in 
areas such as Project Leader Assessment, Management 
& Professional, Academic Workers, other universities, 
how each of these Practices set out the criteria 
according to the needs of talent their respective fields 
[12]. While the example of other studies, academic 
talent as measured from an educational background, 
professionalism, age, gender, occupation and level of 
the position [18]. 

This study will explain in detail how the data 
preparation process and the experiment carried out 
resulted in a prediction model of academic talent 
management high precision after passing through 
several stages of the experiment. This paper is 
organized in five sections. The first part is the 
introduction, followed by the second section predict 
academic talent using data mining, developing 
academic talent, experiment result and conclusion. 

III. DEVELOPING ACADEMIC TALENT 

Methodology of data set preparation in this study 
could be summarized as the figure 1. The data 
preparation involves collecting 15 set raw data from 7 
different systems including personnel information 
systems, conference information systems, university 
performance system, university research system, 
publication system, awards system and student 
information system. The next steps are understanding 
the data, import and create data marts of selected 
talents attributes, pre-processing such as cleaning , 
discretization and finally split the clean data set in the 
different interval year and number of classes before the 
experiment with classification algorithms. 

A. Data Preparation 

The process starts with a collection of 15 sets of 
raw data from the human resource database of seven 
systems in one selected university. The ER diagram of 
the 15 data sets is shown in the figure 2 below. The 
Demography is the main data set which consists of all 
lecturer id's and basic profile information. Each data 
set is linked with the lecturer id's. 



Data 
Preparation 



15 Set raw data 
Data understanding 
Create Data Mart 

Pre-processing 



Data set for 
experiment 



8 data set 



1 year interval 

2 year interval 

3 year interval 

4 year interval 



\"^*24 class 



^6 class 



Figure 1: Data set preparation 




Figure 2: ER Diagram Human Resource 

B. Data understanding 

To select the meaningful attribute, need to 
understand the pattern of records first, for example to 
know the distribution of received data. For each of the 
15 data sets are calculated on every unique record for 
reason finding the meaningful attributes. For example 
in publication data sets, have 3 field like year of 
publishing, type of publications and id writers. To 
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create an attribute publication by year, only the year 
publication has a record are selected as attribute and 
the ways to calculate that record, using the SQL query. 
Here's an example query which is used to calculate 
each attribute value of status field from demography 
data set. 

SELECT [Demography]. [status], 
COUNT(NZ([Demography]. [status])) AS counting FROM 
[Demographyi] GROUP BY [Demography]. [status]; 

Analysis for data understanding, is about of 1140 
attributes have been identified for talent academic 
attribute. 

C. Create Data Mart 



Data mart is a smaller scale of the data warehouse. 
15 sets of raw data are transferred to the database for 
the purpose of creating a data mart to ease the search 
and integration process. This study uses Microsoft 
Access as the database data mart. On every attributes 
value needs to identify, a query is made with related 
data set using join query, select query and others query 
methods. Result from the query will be used back on 
programming to find and calculate automatically using 
that program. The figure 3 can illustrate this technique. 

The program can be categorized into two types, 
involving only search / update data and generate the 
values by conducting calculations. Table 1 shows the 
store procedure name creating to generate the data 
mart by selecting the meaning full attributes from the 
table shows in the ER diagram. 
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Figure 3: Attribute value update technique 
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Table 1: Talent attributes 



No 


Attribute involves (total of 1140 attributes) 


Category 


Program name 


1 


Idkey, year of service, position status, university 
gender, race,current position (as class) 


status, 


Search / update 


Demography 


2 


Accumulated publication (1982, 1986-2011) 


Calculation 


Publication 


3 


Number of students supervised (93/94 - 2011/2012) - 
refer the student id 


- ( 


lid not 


Calculation 


Studentbysession 


4 


Exact number of students supervised (93/94 - 2011/2012) - 
with different student id 


Calculation 


Studentsupervised 


5 


The date appoints DJJK ,VK0501 etc 






Search / update 


Historyposition 


6 


Duration from previous position to next position 
VK0501 etc 


DJJK , 


Search / update 


Calculationposition 


7 


Number of publications by year and type (publication code 
from 1-23 and year 1982 - 2011) 


Calculation 


Publicationjype 


8 


Performance scores year (1977, 1988 - 2010) 






Search / update 


Performance 


9 


Accumulated by position on research (research positions code 
1-9) 


Calculation 


Research 


11) 


Accumulated 


by position on research (1 to 9) and 


by year 


Calculation 


Research_basic 




(2000-2011: 












11 


Number of position attending conference (1-8, A-G) 
(1996-2011) 


b 


y year 


Calculation 


Conference nosition 


12 


The accumulated number of attending International, 
Departmental, Nasional and University category 


Calculation 


Cateporv conference 




Number of received awards by year (1990-2010) and type 
(Service award, Publication award and Research award) 


V^dlL/ UldLlUll 


J\ W ell LI 


14 


Accumulated holding the administration by 
(Associate members, Webmaster etc) 


position 


Calculation 


Administration 


15 


Latest education 






Search / update 


Education 


16 


The amount of the grant received by year (1996-201 1) 






Search / update 


Grant 



Sample pseudo code for calculation publication by 
year, which calculates the number of publication 
occurs in that year: 

Sub Publication ( ) 

Declaration database 

Declaration first Recordset and second Recordset 

Declaration CountbyYear variable 

Set starting value for CountbyYear as 0 

Open Recordset as first Recordset (query data set) 

Open Recordset as second Recordset (to store 

attribute value) 

Open first Recordset and read next if not empty 
Within the first Recordset 
loop read second Recordset 
Compare id key at first Recordset equal to id 
key second Recordset 
if equal 

using case statement to check the 



year of publication 

if got the record, sum the 

CountbyYear with 1 

Update value of CountbyYear 

into second recordset. 

Looping until end of record second Recordset and 
first Recordset 

Close first Recordset and second Recordset 
Close database 
End Sub 

D. Pre-processing 

Pre-processing is an important step in the data 
mining process. This study majority of the attributes is 
calculated based, the attribute involves filling the 
missing value for attributes gender, race and university 
status. The process is done manually by counter check 
others values that related to missing value for example 
the race attribute counter check with name familiar 
race and spouse information. 
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E. Data set creation 

The overall 1140 attributes have been collected 
only part of the record has a value greater and equal to 
1 is less than 20%. Most attributes values are 0 on the 
attributes of for year 2000 and below. No records were 
removed because if it is made, the number of attributes 
and records are too small and not suitable for 
modeling, in order to overcome this problem this study 
are adding the appropriate attributes related to the 
intervals of 2, 3 and 4 year for the attributes on 
category 2,3,4,7,10, Hand 13 as Table 1. The class 
also category to 24 classes and 6 classes as Table 2. 

The 8 data sets involve being splits as below : 

i. Model set 1 year interval 24 classes: 3220 
records and 1108 attributes. 

ii. Model set 2 year interval 24 classes: 3220 
records and 624 attributes. 

iii. Model set 3 year interval 24 classes: 3220 
records and 459 attributes. 

iv. Model set 4 year interval 24 classes: 3220 
records and 371 attributes. 

v. Model set 1 year interval 6 classes: 3220 
records and 1103 attributes. 

vi. Model set 2 year interval 6 classes: 3220 
records and 609 attributes. 

vii. Model set 3 year interval 6 classes: 3220 
records and 454 attributes. 

viii. Model set 4 year interval 6 classes: 3220 
records and 366 attributes. 



Table 2: Class attributes 



DESCRIPTION 


Number of class 


24 


6 


4 


LECTURER (JKK) 


A 


A 


A 


TRAINEE DENTAL LECTURER 
DUG45 


B 


A 


A 


TRAINEE MEDICAL LECTURER 
DU45 


C 


A 


A 


DENTAL LECTURER DUG45 


D 


A 


A 


DENTAL LECTURER DUG51 


E 


B 


B 


DENTAL LECTURER DUG53 


F 


C 


C 


DENTAL LECTURER DUG54 


G 


C 


C 


MEDICAL LECTURER DU1 


H 


A 


C 


MEDICAL LECTURER DU2 


I 


A 


A 


MEDICAL LECTURER DU45 


J 


A 


A 


MEDICAL LECTURER DU5 1 


K 


B 


B 


MEDICAL LECTURER DU52 


L 


B 


B 


MEDICAL LECTURER DU53 


M 


C 


C 


MEDICAL LECTURER DU54 


N 


C 


C 


UNIVERSITY LECTURER DS 1 


O 


A 


C 


UNIVERSITY LECTURER DS2 


P 


A 


A 


UNIVERSITY LECTURER DS45 


Q 


A 


A 


UNIVERSITY LECTURER DS75 1 


R 


B 


B 



UNIVERSITY LECTURER DS52 


S 


B 


B 


UNIVERSITY LECTURER DS53 


T 


C 


C 


UNIVERSITY LECTURER DS4 


U 


C 


C 


PROOFESSOR VK07 


V 


D 


D 


PROFESSOR VK06 


w 


E 


D 


PROFESSOR VK05 


X 


F 


D 



A - Represents Lecturer 

B - Represents Senior Lecturer 

C - Represents Associate professor 

D - Represents Professor v7 

E - Represents Professor v6 

F - Represents Professor v5 

F. Model Development 

The forecasting model development process is 
based on CRISP-DM standard process which is 
involved choosing the technique, planning the 
experiment, develop model and evaluate model[19]. 
Figure 4 shows the development stages involved. This 
study was selected of 14 classification algorithms from 
the five main groups classification algorithm they are: 
J48 Decision Tree (C4.5 version 8), REPTree, 
Decision Stump, Random forest, Random tree, 
BayesNet, Naive Bayes, MultilayerPerceptron, 
RBFNetwork, K-star, IBK, IB1, Jrip, PART available 
in Weka one of the popular data mining software[20]. 
The discretization technique is Entropy-based & MDL 
stopping Criterion available in Weka for the 1st phase 
experiment. We choose many classifier in experiment 
setting to test many classifier which is to find that the 
most fit to identifying the academic talents. 

Data sets that produce the best model from the first 
phase of modeling will go through 2 nd stages of the 
experimental sessions using the discretization 
techniques available with different software Tanagra. 
Tanagra one of the best open source for the pre- 
processing availability technique provided[20]. 3 types 
of discretization techniques were chosen: Entropy- 
based & MDL stopping Criterion, Equal Width and 
Equal Frequency. Redo the experiment again with the 
best algorithms from earlier stages after 3 technique 
discretization applied to the data set. A comparison is 
made whether the model is with different discretization 
techniques and using different software for 
discretization can produce better result. 

3 rd phase, data sets from the best model will 
undergo the final phase of refining the model. The 
professor post of VK7, VK6 and VK5 merge together 
as show in Table 2. Then the number of classes will be 
reduced from six classes to four classes for further 
experiments. The data set is divided into three sets of 
data such as, the best data set with 4 classes, the best 
data set without interval year and 6 classes, the best 
data set without interval year and 4 classes. 
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2 nd phase experiment continues using J48 
algorithms and the same data set from the best result of 
the 1st phase, the results did not show a significant 
difference with different discretization technique 
applied. The last phase which is 3 rd phase the 4 year 
interval with a number of class reduce to 4 gives the 
best result as 87.47% and finally accepted as a talent 
academic model. 



Figure 4: Model development step 



IV. EXPERIMENT RESULT 




f /f v 



— Hear 
Interval 

— 2 Year 
Interval 

3 Year 
Interval 
_ 4Year 
Interval 



Table 3 shows the overall experiment results. At the 
1 st phase, the best result is J48 (86.96% ) with data set 
4 years interval and 6 classes. Based on the results the 
discretization applied in the 1 st phase did not show a 
significant difference to the results to all 14 algorithms. 
A conclusion also can be made , beside J48, Multilayer 
Perceptron, Jrip and PART also possible can produce 
the best prediction academicians talents as shown in 
Figure 5. 

Table 3: Experiment results 



Figure 5: Pattern result from 14 classification algorithms 

That shows a decision tree J48 still can give the 
best result compares to other classifier for talents 
domain same as previous research as mention in this 
paper. There are 117 rules successfully extracted from 
the best model for Lecturer, Senior Lecturer, Associate 
Professor and Professor talent model. 



I 



Phases 


Algorithm / Discretization/ Data set 


Result 


1 st 

Phase 


J48 


86.96 


REPTree 


83.33 


Random Forest 


78.73 


Decision Stump 


53.26 


Random Tree 


65.84 


BayesNet 


68.83 


Naive Bayes 


73.91 


Kstar 


58.70 


IB1 


67.86 


Ibk 


68.32 


JRip 


85.51 


PART 


85.71 


Multilayer perceptron 


81.10 


RBFNetwork 


72.21 


2 nd 

Phase 

-J48 and same data set the best result (86.96) 


Entropy-based & MDL stopping criterion 


86.96 


Equal width 


79.50 


Equal frequency 


71.27 


3 rd 

Phase 

-J48 and same data but different class and 
interval year the best result (86.96) 


4 year interval , 4 class 


87.47 


No year interval, 6 class 


86.34 


No year interval, 4 class 


87.37 
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V. CONCLUSIONS 

This paper has presented sample pseudo code to 
develop an automated academic talent data mart. The 
pseudo code can be stored as a store produced, to 
generate the latest data mart. Using the data mart, the 
latest academic talent model can be generated. The 
experiment result shows that J48 has outperformed 
compare to other 14 classification techniques. The 
result shows that applying discretization do not 
significantly get the better result. However, changing 
the number of class and arrangement in various 
interval years has influenced to get better results. This 
paper contributes on how to develop latest and 
accurate academic talent management using data 
mining and what data mining techniques and 
improvement to obtain better results. The accurate 
result is considered lower, it may because un-balance 
data where many zero value gets from auto generated 
using pseudo code. 
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